Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable BGP when running test_neighbor_mac_noptf #3369

Merged
merged 4 commits into from
May 10, 2021

Conversation

alexrallen
Copy link
Contributor

@alexrallen alexrallen commented Apr 20, 2021

Description of PR

I disable BGP during the execution of test_neighbor_mac_noptf to ensure the switch (swss) is not overloaded by BGP related route updates during the test as a result of the neighbor updates which may cause the test not to complete within the prescribed time of 2 seconds and thus fail.

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Approach

What is the motivation for this PR?

We are occasionally seeing test_neighbor_mac_noptf fail on Mellanox MSN3800 and MSN4600 switches where the neighbor updates in the following code trigger a large number of BGP updates (~8000 lines in swss.rec) and it seems the third neighbor IP update in the code below never gets scheduled or recognized.


    @pytest.fixture(autouse=True)
    def updateNeighborIp(self, duthosts, enum_rand_one_per_hwsku_frontend_hostname, enum_frontend_asic_index, routedInterfaces, ipVersion, verifyOrchagentPresence):
        """
            Update Neighbor/Interface IP
            Prepares the DUT for testing by adding IP to the test interface, add and update
            the neighbor MAC 2 times.
            Args:
                duthost (AnsibleHost): Device Under Test (DUT)
                routedInterface (Fixture<str>): test Interface name
                ipVersion (Fixture<int>): IP version
                verifyOrchagentPresence (Fixture): Make sure orchagent is running before and
                    after update takes place
            Returns:
                None
        """
        duthost = duthosts[enum_rand_one_per_hwsku_frontend_hostname]
        asichost = duthost.asic_instance(enum_frontend_asic_index)
        routedInterface = routedInterfaces[asichost.asic_index]
        self.__updateInterfaceIp(asichost, routedInterface, ipVersion, action="add")
        self.__updateNeighborIp(asichost, routedInterface, ipVersion, 0, action="add")
        self.__updateNeighborIp(asichost, routedInterface, ipVersion, 0, action="change")
        self.__updateNeighborIp(asichost, routedInterface, ipVersion, 1, action="change")

        time.sleep(2)

        yield

        self.__updateNeighborIp(asichost, routedInterface, ipVersion, 1, action="del")
        self.__updateInterfaceIp(asichost, routedInterface, ipVersion, action="remove")

I believe there is a significant likelihood that due to the behavior of the test, that the 3rd neighbor update is never polled for because swss is too busy with the massive number of route updates resulting from the prior updates.

Also, in the test, if it does not successfully register the update within two seconds, it automatically tears down the test possibly removing any evidence that the switch would have eventually processed the command once it completed all the route updates.

Finally, this seems to be very similar to a bug that was filed previously and it was determined that route updates were causing the issue similar to what we see here.

See sonic-net/sonic-buildimage#2414

How did you do it?

Edited the configuration test fixture method in the test to disable BGP on setup and enables it again on teardown. This is done similarly to other tests in sonic-mgmt.

Note: The 120 second delay seems to be necessary to allow all routes and neighbors to flush after BGP comes down.

How did you verify/test it?

Test was manually run against a Mellanox MSN2700, however I was unable to reproduce the original issue during testing and thus have not determined if this fix is sufficient for that specific failure case.

@alexrallen alexrallen requested a review from a team as a code owner April 20, 2021 12:08
@dgsudharsan
Copy link
Contributor

@yxieca Can you please review this?

@liat-grozovik
Copy link
Collaborator

something strange, i cannot set reviewer. is this by purpose?

@liat-grozovik
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@liat-grozovik liat-grozovik merged commit a6ee181 into sonic-net:master May 10, 2021
vmittal-msft pushed a commit to vmittal-msft/sonic-mgmt that referenced this pull request Sep 28, 2021
Disabled BGP during the execution of test_neighbor_mac_noptf to ensure the switch (swss) is not overloaded by BGP related route updates during the test as a result of the neighbor updates which may cause the test not to complete within the prescribed time of 2 seconds and thus fail.

Also, in the test, if it does not successfully register the update within two seconds, it automatically tears down the test possibly removing any evidence that the switch would have eventually processed the command once it completed all the route updates.

- How did you do it?
Edited the configuration test fixture method in the test to disable BGP on setup and enables it again on teardown. This is done similarly to other tests in sonic-mgmt.

Note: The 120 second delay seems to be necessary to allow all routes and neighbors to flush after BGP comes down.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants